πŸ•ΈοΈ Ada Research Browser

skilled-deep-research-postmortem.md
← Back

Skilled Deep Research β€” Post-Mortem

Written: 2026-03-07 | Run: cmmc-templates (CMMC 2.0 templates for small business)


TL;DR

The run produced 5 sources from 5 workers in ~39 minutes. That's 1 source per worker. A properly functioning run should produce 10-20 sources per worker (50-100 total). Actual yield: ~5-10% of expected. The skill architecture is sound but has multiple critical bugs.


What Actually Happened (Timeline)

Time (UTC) Event
05:30 Ada spawned orchestrator
05:30–05:35 Orchestrator spawned 5-6 workers in parallel
05:30–05:45 Workers each fetched 1-2 URLs then went silent
06:09 Orchestrator declared "complete", merged 5 sources
06:09 Report written β€” only 1 source per worker

Bug #1 β€” CRITICAL: Workers can't spawn (agentId missing)

What happened: The resume orchestrator failed immediately with:

"error": "ACP target agent is not configured. Pass `agentId` in `sessions_spawn` or set `acp.defaultAgent` in config."

Root cause: The worker prompt template in SKILL.md calls sessions_spawn without an agentId. Sub-agents (depth β‰₯ 1) can't inherit the default agent from config β€” they need it explicitly passed.

Evidence: The resume orchestrator (c17c0e28) died after 8 lines β€” spawned, tried to spawn workers, got the ACP error, stopped.

This is almost certainly why original workers also failed β€” if the orchestrator spawned workers using the same broken prompt, all workers would fail to spawn. The 5 results we got may have been from the orchestrator itself fetching URLs directly (hallucinating worker output), or from a version of the prompt that omitted the spawn call and fetched inline.

Fix: Add agentId to every sessions_spawn call in orchestrator and worker prompts:

sessions_spawn({
  agentId: "ada",  // ← REQUIRED for sub-agents
  task: "...",
  runtime: "subagent",
  ...
})

The agentId needs to be threaded from Ada β†’ orchestrator prompt β†’ worker prompts at spawn time.


Bug #2 β€” CRITICAL: known-urls.txt not being updated

What happened: After 5 workers fetched multiple URLs, known-urls.txt contained exactly 1 URL.

Expected: Every worker should append fetched URLs to known-urls.txt for deduplication.

Root cause: The worker prompt says to "Append URL to known-urls.txt" but doesn't give the exact file path or exec command. Workers are inconsistent about whether they do this step. Also: if workers are actually running in-process in the orchestrator (due to Bug #1), they may not have write access to the right path.

Impact: No deduplication across workers. Workers could fetch the same URLs. Retry logic is also broken since known-urls.txt is the source of truth.

Fix: Give workers an explicit shell command:

echo "https://fetched-url.com" >> /home/sean/.openclaw/workspace-ada/skills-data/skilled-deep-research/[SLUG]/known-urls.txt

And verify the file exists and is writable at worker startup.


Bug #3 β€” HIGH: Community worker result data loss

What happened: community worker progress.json showed: - urls_fetched: 9 - findings: 9

But community-results.md contained only 1 source block.

Root cause: Workers are supposed to checkpoint (append to results.md) after every URL. The community worker either: 1. Buffered all results in memory and wrote at the end (crashed before writing), or 2. Overwrote instead of appended on a second pass

Fix: Enforce append-only writes in the worker prompt with explicit shell:

cat >> results.md << 'BLOCK'
### [score] [title](url)
...
---
BLOCK

Never write the whole file at once. Checkpoint after every single URL fetch.


Bug #4 β€” HIGH: gov worker found 0 URLs

What happened: gov worker progress showed urls_found: 0, urls_fetched: 0. It was stuck on https://csrc.nist.gov/pubs/sp/800/171/a/final with no results.

Root cause: IPv6 blocking on .gov sites. The SKILL.md documents this explicitly:

"Our LXC uses IPv6 which Akamai CDNs can block on .gov/.mil sites. Never use raw web_fetch or curl without -4 on government sites."

The worker prompt tells workers to use the fetch script (which forces -4) but web_search results don't auto-use it β€” the worker has to consciously call the fetch script for every URL. If a worker instead used web_fetch directly (the native tool), .gov fetches silently fail or return bot-block pages.

Evidence: gov worker shows urls_found: 0 β€” meaning even the search returned nothing actionable, or the worker couldn't parse the results before stalling.

Fix: 1. Add explicit validation step at worker start: verify the fetch script exists and returns 200 for a test URL 2. Add to worker prompt: "DO NOT use the web_fetch tool for any URL. ONLY use the fetch script." 3. Consider a pre-flight search test before committing to the URL list


Bug #5 β€” MEDIUM: Binary file fetch (UnicodeDecodeError)

What happened: retry-queue.md contained:

- https://media.armis.com/raw/upload/cmmc-rfp-template.docx β€” reason: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf0

Root cause: The fetch script returns raw binary for .docx/.pdf files. Workers try to decode as UTF-8 text, fail, and log to retry queue. The retry worker would have the same problem.

Fix: Detect binary content-type before fetching full body. If Content-Type is application/vnd.openxmlformats or application/pdf, just log the direct download URL β€” don't try to read the content. The existence of a direct download link is the finding, not the file contents.


Bug #6 β€” MEDIUM: Orchestrator declared complete too fast

What happened: The orchestrator merged results at 06:09 β€” only ~39 minutes after workers spawned. The workers were showing phase: fetching in their progress files at that point. The orchestrator didn't wait for completion signals.

Root cause: A2A signaling (workers β†’ orchestrator) depends on sessions_send. If workers are broken (Bug #1), they never send WORKER_COMPLETE signals. The orchestrator's fallback is to poll progress files every 120s for up to 15 cycles (30 minutes max). After 15 cycles with no progress, it moves on β€” even if workers are stalled mid-fetch.

Impact: Orchestrator synthesized partial results and declared success.

Fix: Add a minimum threshold check before synthesis: "If fewer than 3 workers sent WORKER_COMPLETE and total findings < 10, do NOT synthesize β€” log a failure and alert Ada instead."


Bug #7 β€” LOW: merge-reports.py parses results with regex, brittle

What happened: The report's source quality scores are all listed as [2/5] despite the underlying worker results showing [5/5] for the NIST templates. The merge script likely failed to parse the format correctly.

Root cause: merge-reports.py uses regex on the results markdown. Any formatting deviation (missing blank line, slightly different header) causes the parser to drop or misparse a source.

Fix: Switch to a more forgiving parser, or enforce results format with a schema validator workers run before writing. At minimum, add a format-check step to the worker prompt.


What Actually Worked


Missing Capability: Site Crawling

Not a bug β€” a genuine missing feature. Current skill: - Fetches individual URLs surfaced by search queries - Does NOT follow links or traverse site structure

For deep research, this is a significant gap. Sites like cmmcaudit.org have a resource index page that links to 8+ template pages. Search engines only surface 1-2 of those. Without crawling, we miss 75%+ of available resources on resource-rich sites.

Proposed solution: A crawl.py helper script:

# Given a root URL + relevance keywords, extract and score internal links
# Return top N links sorted by relevance score (anchor text match)
# Respect: depth limit (2), domain boundary, already-known URLs

Workers call this when they land on a page that looks like a resource index (template, download, tools pages).


Priority Fix List

Priority Bug Effort Impact
πŸ”΄ P0 Bug #1: agentId missing from sessions_spawn Low β€” add one field Workers can't spawn at all
πŸ”΄ P0 Bug #3: Result data loss (buffer vs append) Low β€” change write pattern 80%+ of findings lost
πŸ”΄ P0 Bug #4: gov worker IPv6 block Low β€” strengthen prompt .gov sources completely inaccessible
🟠 P1 Bug #2: known-urls.txt not updated Low β€” add explicit command Dedup broken, retry logic broken
🟠 P1 Bug #6: Orchestrator declares complete too fast Medium β€” add threshold check False "success" on failed runs
🟑 P2 Bug #5: Binary file UnicodeDecodeError Low β€” content-type check Direct download links missed
🟑 P2 Bug #7: merge-reports.py brittle parser Medium β€” improve parser Source scores wrong in final report
🟒 P3 Missing: Site crawling High β€” new script + prompt changes 10x more sources on resource-rich sites

  1. Fix agentId (P0) β€” without this, nothing works
  2. Fix append-only results writing (P0) β€” without this, findings are lost
  3. Fix IPv6 / fetch script enforcement (P0) β€” .gov sources are highest quality
  4. Fix known-urls.txt update (P1) β€” enables proper dedup and retry
  5. Fix orchestrator completion threshold (P1) β€” prevents false success
  6. Fix binary file handling (P2)
  7. Fix merge parser (P2)
  8. Build crawl capability (P3)